Data Science with R and STATA

This webpage provides the materials for a presentation gave at Hamilton College campus visit on May 5, 2023.

Please feel free to contact me at thlim@hamilton.edu if you have any questions.

STATA

General Information

Remember

  • to use Do-file
  • to back-up, back-up, and then, back-up
  • help, Google, and ChatGPT is your friend–when you use it right!
  • to check data type when error

You can download the do-file here:

Download data_science_for_hamiltonians.do

Snowfall Data

//load data set with the second row as the variable names
import delimited "G:\Dropbox\Hamilton Campus Visit\NY-snowfall-202301.csv", varnames(2) clear
(38 vars, 469 obs)
tab jan1
      Jan 1 |      Freq.     Percent        Cum.
------------+-----------------------------------
        0.0 |        254       54.16       54.16
          M |        214       45.63       99.79
          T |          1        0.21      100.00
------------+-----------------------------------
      Total |        469      100.00
// Replace "M" (missing) and "T" (trace) with 0 in all columns of the data
foreach var of varlist jan* {
    replace `var' = subinstr(`var', "M", "0", .)
    replace `var' = subinstr(`var', "T", "0", .)
}

tab jan1
(214 real changes made)
(1 real change made)
(142 real changes made)
(12 real changes made)
(139 real changes made)
(2 real changes made)
(205 real changes made)
(0 real changes made)
(210 real changes made)
(0 real changes made)
(185 real changes made)
(25 real changes made)
(177 real changes made)
(60 real changes made)
(112 real changes made)
(77 real changes made)
(113 real changes made)
(38 real changes made)
(106 real changes made)
(60 real changes made)
(95 real changes made)
(47 real changes made)
(150 real changes made)
(75 real changes made)
(166 real changes made)
(39 real changes made)
(157 real changes made)
(54 real changes made)
(120 real changes made)
(95 real changes made)
(89 real changes made)
(20 real changes made)
(122 real changes made)
(14 real changes made)
(187 real changes made)
(33 real changes made)
(147 real changes made)
(19 real changes made)
(180 real changes made)
(46 real changes made)
(152 real changes made)
(52 real changes made)
(110 real changes made)
(85 real changes made)
(121 real changes made)
(16 real changes made)
(125 real changes made)
(78 real changes made)
(88 real changes made)
(92 real changes made)
(134 real changes made)
(15 real changes made)
(95 real changes made)
(55 real changes made)
(117 real changes made)
(65 real changes made)
(113 real changes made)
(28 real changes made)
(126 real changes made)
(55 real changes made)
(108 real changes made)
(22 real changes made)

      Jan 1 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        215       45.84       45.84
        0.0 |        254       54.16      100.00
------------+-----------------------------------
      Total |        469      100.00
// Check data type
describe jan1
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
jan1            str3    %9s                   Jan 1
// Convert columns starting with Jan. to numeric
foreach var of varlist jan* {
    destring `var', replace
}

// Check data type
describe jan1
tab jan1
jan1 has all characters numeric; replaced as byte
jan2 has all characters numeric; replaced as double
jan3 has all characters numeric; replaced as byte
jan4 has all characters numeric; replaced as byte
jan5 has all characters numeric; replaced as double
jan6 has all characters numeric; replaced as double
jan7 has all characters numeric; replaced as double
jan8 has all characters numeric; replaced as double
jan9 has all characters numeric; replaced as double
jan10 has all characters numeric; replaced as double
jan11 has all characters numeric; replaced as double
jan12 has all characters numeric; replaced as double
jan13 has all characters numeric; replaced as double
jan14 has all characters numeric; replaced as double
jan15 has all characters numeric; replaced as double
jan16 has all characters numeric; replaced as double
jan17 has all characters numeric; replaced as double
jan18 has all characters numeric; replaced as double
jan19 has all characters numeric; replaced as double
jan20 has all characters numeric; replaced as double
jan21 has all characters numeric; replaced as double
jan22 has all characters numeric; replaced as double
jan23 has all characters numeric; replaced as double
jan24 has all characters numeric; replaced as double
jan25 has all characters numeric; replaced as double
jan26 has all characters numeric; replaced as double
jan27 has all characters numeric; replaced as double
jan28 has all characters numeric; replaced as double
jan29 has all characters numeric; replaced as double
jan30 has all characters numeric; replaced as double
jan31 has all characters numeric; replaced as double

              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
jan1            byte    %10.0g                Jan 1

      Jan 1 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        469      100.00      100.00
------------+-----------------------------------
      Total |        469      100.00
// Add a new column with the sum of each row
egen total = rowtotal(jan1 - jan31)

sum total,d
                            total
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs                 469
25%          1.1              0       Sum of Wgt.         469

50%          7.5                      Mean           7.889765
                        Largest       Std. Dev.      6.822664
75%         11.8           29.7
90%         16.4           31.4       Variance       46.54874
95%         19.4           32.9       Skewness       .8291596
99%         29.5           33.8       Kurtosis       3.733055
// Check if Elevation and Latitude are numeric
describe elevation latitude
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
elevation       int     %8.0g                 Elevation
latitude        float   %9.0g                 Latitude
sum elevation, d
                          Elevation
-------------------------------------------------------------
      Percentiles      Smallest
 1%            5          -9999
 5%           30              3
10%          100              4       Obs                 469
25%          380              4       Sum of Wgt.         469

50%          642                      Mean           743.5416
                        Largest       Std. Dev.      723.4853
75%         1110           2029
90%         1568           2169       Variance         523431
95%         1698           2220       Skewness      -6.733887
99%         2021           2495       Kurtosis       104.8178
tab stationname if elevation == -9999
                  Station Name |      Freq.     Percent        Cum.
-------------------------------+-----------------------------------
                 BATAVIA 1.4 E |          1      100.00      100.00
-------------------------------+-----------------------------------
                         Total |          1      100.00
recode elevation -9999=900
(elevation: 1 changes made)

Plot Correlation

twoway scatter total elevation, ///
  xtitle("Elevation") ytitle("Total") ///
  title("Scatterplot of Total vs. Elevation")

quietly graph export scatter_total_elevation.png, replace
>   xtitle("Elevation") ytitle("Total") ///
>   title("Scatterplot of Total vs. Elevation")

Total ~ Elevation

twoway (scatter total latitude) (lfit total latitude), ///
  xtitle("Latitude") ytitle("Total") ///
  title("Scatterplot of Total vs. Latitude")
//quietly graph export lfit_total_latitude.png, replace
>   xtitle("Latitude") ytitle("Total") ///
>   title("Scatterplot of Total vs. Latitude")

Total ~ Latitude

Linear Regression

reg total elevation latitude
      Source |       SS           df       MS      Number of obs   =       469
-------------+----------------------------------   F(2, 466)       =    197.88
       Model |  10004.6003         2  5002.30013   Prob > F        =    0.0000
    Residual |   11780.211       466  25.2794228   R-squared       =    0.4592
-------------+----------------------------------   Adj R-squared   =    0.4569
       Total |  21784.8113       468   46.548742   Root MSE        =    5.0279

------------------------------------------------------------------------------
       total |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   elevation |   .0051263   .0004693    10.92   0.000     .0042041    .0060484
    latitude |   3.313543   .2761122    12.00   0.000     2.770963    3.856122
       _cons |  -136.9432   11.62777   -11.78   0.000    -159.7926   -114.0939
------------------------------------------------------------------------------
margins, at(latitude=(40.54(1)44.84))
Predictive margins                              Number of obs     =        469
Model VCE    : OLS

Expression   : Linear prediction, predict()

1._at        : latitude        =       40.54

2._at        : latitude        =       41.54

3._at        : latitude        =       42.54

4._at        : latitude        =       43.54

5._at        : latitude        =       44.54

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         _at |
          1  |   1.318498   .5947581     2.22   0.027     .1497579    2.487238
          2  |    4.63204   .3571999    12.97   0.000     3.930118    5.333962
          3  |   7.945583   .2322118    34.22   0.000     7.489271    8.401895
          4  |   11.25913   .3643196    30.90   0.000     10.54321    11.97504
          5  |   14.57267   .6033334    24.15   0.000     13.38708    15.75826
------------------------------------------------------------------------------
marginsplot
quietly graph export marginsplot_latitude.png, replace
  Variables that uniquely identify margins: latitude

Marginal Effect of Latitude

Marginal Effect of Elevation

margins, at(elevation=(0(100)2500))
Predictive margins                              Number of obs     =        469
Model VCE    : OLS

Expression   : Linear prediction, predict()

1._at        : elevation       =           0

2._at        : elevation       =         100

3._at        : elevation       =         200

4._at        : elevation       =         300

5._at        : elevation       =         400

6._at        : elevation       =         500

7._at        : elevation       =         600

8._at        : elevation       =         700

9._at        : elevation       =         800

10._at       : elevation       =         900

11._at       : elevation       =        1000

12._at       : elevation       =        1100

13._at       : elevation       =        1200

14._at       : elevation       =        1300

15._at       : elevation       =        1400

16._at       : elevation       =        1500

17._at       : elevation       =        1600

18._at       : elevation       =        1700

19._at       : elevation       =        1800

20._at       : elevation       =        1900

21._at       : elevation       =        2000

22._at       : elevation       =        2100

23._at       : elevation       =        2200

24._at       : elevation       =        2300

25._at       : elevation       =        2400

26._at       : elevation       =        2500

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         _at |
          1  |          .  (not estimable)
          2  |   5.984411   .5999464     9.97   0.000     4.805462    7.163359
          3  |    5.60585   .4222693    13.28   0.000     4.776053    6.435648
          4  |   5.453481   .3367075    16.20   0.000     4.791821    6.115142
          5  |   5.502861   .3223978    17.07   0.000      4.86932    6.136402
          6  |   5.729548   .3373162    16.99   0.000     5.066691    6.392405
          7  |     6.1091   .3529676    17.31   0.000     5.415487    6.802713
          8  |   6.617076   .3605384    18.35   0.000     5.908586    7.325567
          9  |   7.229034   .3614414    20.00   0.000     6.518769    7.939299
         10  |   7.920532   .3610039    21.94   0.000     7.211127    8.629937
         11  |   8.667128   .3648421    23.76   0.000     7.950181    9.384076
         12  |   9.444381   .3762415    25.10   0.000     8.705032    10.18373
         13  |   10.22785   .3949241    25.90   0.000     9.451787    11.00391
         14  |   10.99309    .417921    26.30   0.000     10.17184    11.81434
         15  |   11.71566   .4419586    26.51   0.000     10.84717    12.58415
         16  |   12.37112   .4662718    26.53   0.000     11.45486    13.28739
         17  |   12.93503     .49549    26.11   0.000     11.96135    13.90871
         18  |   13.38294   .5421034    24.69   0.000     12.31766    14.44823
         19  |   13.69042   .6260242    21.87   0.000     12.46023    14.92062
         20  |   13.83302   .7685783    18.00   0.000      12.3227    15.34335
         21  |    13.7863   .9852521    13.99   0.000      11.8502    15.72241
         22  |   13.52582   1.284649    10.53   0.000     11.00137    16.05027
         23  |   13.02714   1.671993     7.79   0.000     9.741524    16.31276
         24  |   12.26581   2.151945     5.70   0.000     8.037047    16.49458
         25  |    11.2174   2.729625     4.11   0.000     5.853439    16.58136
         26  |   9.857454   3.410736     2.89   0.004     3.155052    16.55986
------------------------------------------------------------------------------
marginsplot
quietly graph export marginsplot_elevation.png, replace
  Variables that uniquely identify margins: elevation

Marginal Effect of Elevation

Plot by Date in New Hartford, NY

//reshape data to long
reshape long jan, i(ghcnid) j(date) string
rename jan inches
(note: j = 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25 26 27 28 29 3 30
>  31 4 5 6 7 8 9)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                      469   ->   14539
Number of variables                  39   ->      10
j variable (31 values)                    ->   date
xij variables:
                    jan1 jan10 ... jan9   ->   jan
-----------------------------------------------------------------------------
//destring date
destring date, replace
date has all characters numeric; replaced as byte
line inches date if stationname == "NEW HARTFORD 0.8 S"
quietly graph export line_08S.png, replace

Line Graph of New Hartford 0.8 S

scatter inches date if stationname == "NEW HARTFORD 1.0 WSW"
quietly graph export scatter_10WSW.png, replace

Scatter Plot of New Hartford 1.0 WSW

twoway (scatter inches date if stationname == "NEW HARTFORD 0.8 S") (line inches date if stationname == "NEW HARTFORD 0.8 S")
quietly graph export scatter_line_08S.png, replace
> hes date if stationname == "NEW HARTFORD 0.8 S")

Scatter Plot and Line Graph of New Hartford 0.8 S

R

General Information

Remember

  • use RScript
  • back-up, back-up, and then, back-up
  • CRAN, Google, and ChatGPT is your friend–when you use it right!
  • check data structure and type when error
  • Kaggle

You can download the rscript here:

Download data_science_for_hamiltonians.R

Snowfall Data

####Load and Reshape Data####
# Attach libraries
library(tidyverse)
####get total snowfall by GHCN####
# Load the CSV file of snowfall
# Read from csv with the first row skipped and with the second row as the header of the dataframe
snowfall <- read.csv("NY-snowfall-202301.csv", skip = 1, header = TRUE)
# Replace "M" (missing) and "T" (trace) with 0 in all columns of the dataframe
snowfall <- snowfall %>% 
  mutate_all(~ str_replace_all(., c("M" = "0", "T" = "0")))
# Check data type
snowfall$Jan.1 %>% typeof()
[1] "character"
# Convert columns starting with Jan. to numeric
snowfall <- snowfall %>% 
  mutate_at(vars(starts_with("Jan.")), as.numeric)
# Add a new column with the sum of each row
snowfall <- snowfall %>% 
  mutate(total = rowSums(select(., starts_with("Jan."))))

Interactive Graph of Top 15

####Produce an interactive graph of top 15####
# Attach libraries
library(ggplot2)
library(plotly)
# Select top 15 stations and reorder factor levels
top_stations <- snowfall %>% 
  top_n(15, total) %>% 
  mutate(Station.Name = fct_reorder(Station.Name, desc(total))) %>% 
  rename(Station = Station.Name, Snowfall = total)
# Create ggplot object with custom color scheme
p <- ggplot(top_stations, aes(x = Station, y = Snowfall,  text = paste("Snowfall: ", Snowfall))) +
  geom_bar(stat = "identity", aes(fill = Snowfall)) +
  scale_fill_gradient(low = "grey", high = "blue") +
  labs(title = "Top 15 Stations with the Most Snowfall in January",
       x = "Station",
       y = "Total January Snowfall (in)") +
  theme_bw() +
  labs(fill = "Total\nSnowfall") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
p

# Convert ggplot to interactive plot
ggplotly(p + theme(legend.position = "none"), tooltip = "text", 
         hovertemplate = paste("<b>%{x}</b><br>",
                               "Snowfall: %{y}<br>")) %>%
  layout(hoverlabel = list(bgcolor = c(low = "grey", high = "blue")))

Interactive Graph by Date

####Produce an interactive graph by date####
# Attach libraries
library(ggplot2)
library(plotly)
# Leave the columns with the ID and dates
new_snowfall <- snowfall %>% select(Station.Name, starts_with("Jan."))
# tidy
tidy_snowfall <- new_snowfall %>% 
  gather(key = "date", value = "Snowfall", -Station.Name) 
# Change Jan. to YYYY-MM-DD format
tidy_snowfall$ymd <- as.POSIXct(paste0("2023-", sub("Jan\\.", "01-", tidy_snowfall$date)), format = "%Y-%m-%d")
tidy_snowfall$Date <- ymd(tidy_snowfall$ymd)
tidy_snowfall$Station <- tidy_snowfall$Station.Name
# Create ggplot object
ggp <- ggplot(tidy_snowfall, aes(x = Date, y = Snowfall, color = Station)) +
  geom_line() +
  labs(title = "Snowfall by Stations",
       x = "Date",
       y = "Snowfall (in)") +
  theme_bw() +
  theme(legend.position = "none") +
  scale_x_date(date_labels = "%b/%d")
ggp

# Convert to plotly object for interactivity
ggplotly(ggp)

Interactive Map

####Produce an interactive map of snowfall by NYS assembly districts####
# Attach libraries
library(sf)
library(tmap)
library(leaflet)
# Read in the shapefile for the New York State Assembly Districts
ad_shapefile <- st_read("https://services6.arcgis.com/EbVsqZ18sv1kVJ3k/arcgis/rest/services/NYS_Assembly_Districts/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson")
Reading layer `OGRGeoJSON' from data source 
  `https://services6.arcgis.com/EbVsqZ18sv1kVJ3k/arcgis/rest/services/NYS_Assembly_Districts/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson' 
  using driver `GeoJSON'
Simple feature collection with 150 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -79.76259 ymin: 40.47658 xmax: -71.77749 ymax: 45.01586
Geodetic CRS:  WGS 84
# Create a spatial points dataframe
snowfall_sf <- st_as_sf(snowfall, coords = c("Longitude", "Latitude"), crs = 4326)

# Perform spatial join
snowfall_by_district <- st_join(snowfall_sf, ad_shapefile, join = st_intersects)

# Summarize snowfall by district
snowfall_summarized <- snowfall_by_district %>% 
  group_by(District) %>% 
  summarize(total_snowfall = sum(total))

# Count the number of observatories in each district
observatory_counts <- snowfall_by_district %>%
  group_by(District) %>%
  summarize(observatory_count = n())

# Add the observatory count to the snowfall_summarized data frame
snowfall_summarized <- st_join(snowfall_summarized, observatory_counts)

# Calculate the average snowfall for each district
snowfall_averages <- snowfall_summarized %>%
  mutate(avg_snowfall = round(total_snowfall / observatory_count, 2)) %>% #create a new column with average snowfall that displys two decimal digits
  select(-District.y) %>%
  rename(District = District.x)

# Create a shapefile containing the information on snowfall, sponsoring, and the number of mentions
nyassembly <- st_join(ad_shapefile, snowfall_averages) %>% 
  select(-c(District.y, District.x)) %>%
  mutate(OBJECTID = OBJECTID - 150)
# Define the color palette
pal <- leaflet::colorNumeric(palette = c("grey", "blue"), domain = nyassembly$avg_snowfall)

# Create the leaflet map
leaflet <- leaflet() %>%
  addTiles() %>% 
  setView(lng = -75.3, lat = 43, zoom = 6.7) %>% 
  addProviderTiles("CartoDB.Positron") %>%
  addPolygons(data = nyassembly, 
              fillColor = ~ifelse(is.na(avg_snowfall), "#FFFFFF", pal(avg_snowfall)), 
              weight = 0.3,
              color = ~ifelse(is.na(avg_snowfall), "#FFFFFF", pal(avg_snowfall)),  
              opacity = 1.0,
              fillOpacity = 0.5, 
              label = ~paste("District: ", OBJECTID, ", ",
                             "Average Snowfall: ", format(round(avg_snowfall, 2), nsmall = 2)),
              highlightOptions = leaflet::highlightOptions(
                weight = 1.5,
                color = "white",
                fillOpacity = 0,
                bringToFront = TRUE)) %>%
  addLegend("bottomleft", 
            pal = pal, 
            values = nyassembly$avg_snowfall,
            title = "Average Snowfall", 
            opacity = 0.8) %>%
  addScaleBar(position = "topright")
leaflet

Linear Regression

# Check data type
is.numeric(snowfall$Elevation)
[1] FALSE
is.numeric(snowfall$Latitude)
[1] FALSE
# Correct data type
snowfall$Elevation <- as.numeric(snowfall$Elevation)
snowfall$Latitude <- as.numeric(snowfall$Latitude)
# Linear regression
model <- lm(total ~ Elevation + Latitude, data = snowfall)
summary(model)

Call:
lm(formula = total ~ Elevation + Latitude, data = snowfall)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.9577  -2.7186   0.2336   2.4818  27.4458 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.582e+02  1.202e+01 -13.162  < 2e-16 ***
Elevation    2.530e-03  3.506e-04   7.216 2.18e-12 ***
Latitude     3.861e+00  2.839e-01  13.601  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.344 on 466 degrees of freedom
Multiple R-squared:  0.3891,    Adjusted R-squared:  0.3864 
F-statistic: 148.4 on 2 and 466 DF,  p-value: < 2.2e-16